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Abstract 

A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely 
at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient 
communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two 
models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and 
the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number 
of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower 
bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic 
programming (DP) and entropy coding, and achieve rates that are approximately optimal. 

I. Introduction 

As the paradigm of cloud computing becomes pervasive, storing and transmitting files and their edited versions consumes a 
huge amount of resources (storage, bandwidth, computation) in client-datacentre channels, and intra-datacentre traffic. Industrial 
projections ID predict the size of the digital universe will expand exponentially to 40 zetabytes (ZB) in 2020. By then, nearly 
40 % of information will be “touched” by cloud computing JT). 

If a file is “lightly edited”, storing and transmitting the entire new file from clients to servers wastes a significant amount 
of space and bandwidth. Scenarios in which the number of edits is a small fraction of the original file are very common in 
real-life editing behaviour. For example, data-backup systems such as Dropbox and Time Machine keep regular snapshots of 
users’ files. In revision-control software such as CVS, Git and Mercurial, users (programmers) are likely to periodically commit 
and store their code after a small number of edits. Currently, many online-backup services use delta encoding (also known 
as delta compression ), and only upload the edited pieces of files 0-g). However, to the best of our knowledge, no existing 
techniques provide information-theoretically optimal compression guarantees, and indeed this is the primary contribution of 
our work. 

There are potentially many other types of edits besides symbol insertions and deletions (for instance block insertions/deletion, 
substitutions, transpositions, copy-paste, crop, etc. - these and other edit models have been considered in, among other 
works, 0-E2). Since these other edit models are in general a combination of symbol insertions and deletions, we focus on 
the “base case” of symbol insertions-deletionsQ 

A. Our work/contributions 

In this work, we study the problem of one-way communication of file updates to a data-centre. The client (henceforth called 
the encoder) has a file X (henceforth called the pre-edit source sequence) drawn from some distribution, and edits it according 
to some process - we shortly describe both the source and the edit process in more detail - to generate the new file Y. 
The encoder has both the old file X and the edited version of the file Y0 The encoder transmits a function of X, Y to the 
data-centre (henceforth called the decoder). The pre-edit source sequence X is available at the decoder as side-information. 
The goal of communication is for the decoder to reconstruct Y. A “good” communication scheme manages to achieve this 
while requiring minimal communication from the encoder to the decoder. 0 

We now discuss the pre-edit source sequence, and the edit process. There are many possible combinations of different 
pre-edit source sequence processes, and edit processes. Some of those that have been studied in the literature include: arbitrary 
input processes HD, CD, random input processes GO), (partial) permutations 0, duplications C2; random edit 

processes HD-ED, Markov edit processes ED- 

In this work, we consider two models. In the Random Pre-Edit Sequence, Left-to-Right Random InDel (RPES-LtRRID) 
process, a file is modeled as a sequence of symbols drawn i.i.d. uniformly at random from an alphabet A. The new file is 
obtained from the old file through a left-to-right random InDel process , which is modeled as a Markov chain of three states: the 
“insert symbol” state, the “delete symbol” state, and the “no-operation” state. Roughly speaking, these three states correspond 

1 A caveat here - as is common in the literature, we characterize the compression performance of our file update scheme in terms of the number of symbols 
inserted and deleted. However, explicitly modeling other common user operations can lead to different schemes and possibly better compression performance 
in practice. 

2 The encoder may actually ALSO have access to the actual edit process, but as we shall see this doesn’t necessarily help in our problem. 

3 Several authors have considered the ’’interactive communication” version of the problem, in which the encoder and decoder communicate in multiple 
rounds. While tis is an interesting problem in its own right, we choose to focus on the relating less explored one-way communication problem, since as we 
show, there is little throughput penalty with such a restriction. 
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Table 1: (Related work) The content of each column is as follows - 1 Two aspects of each communication model are shown here. The 
first aspect concerns what information is available to which party. Depending on the specific model considered, either the original file (the 
pre-edit source sequence) X, or the new file (the post-edit source sequence) Y. or both may be available at the encoder and the decoder. The 
second aspect considered is whether interactive/two-way transmissions between the encoder and decoder are allowed, or only the encoder 
is allowed to transmit (one-way communication). 2 The size of the source alphabet - 2 denotes a binary source alphabet, and |.A| denotes 
a general alphabet. 3 ‘Arb’ represents an arbitrary (“worst-case”) pre-edit source sequence; ‘Ran' represents the pre-edit sequences drawn 
i.i.d. from the alphabet. 4 ‘Arb' represents the positions and contents of the edits being arbitrary; ‘Ran’ represents random positions and 
contents of edits; ‘Markov’ represents the edit process being a Markov chain. 5 Here ‘Ins’,‘Del’ and ‘Sub’ respectively represent insertion, 
deletion and substitution edit operations. 6 Upper bounds on the number of edits in each work, as a function of n (length of the pre-edit 
source sequence X). 7 Whether an explicit information-theoretical lower bound is presented, where ‘Y’ and ‘N’ stands for ‘Yes’ and "No’ 
respectively, and ‘-’ for the case where the number of edits is o(n) or within a factor of order-optimal lower bounds in some two-way 
communication models. 8 Whether the algorithm is deterministic (‘D") or random (‘R’). 9 The complexity of the algorithm, as a function 
of n (length of the pre-edit source sequence X). 10 Whether the algorithm has “small” error - e-error, or zero error. 11 The number of bits 
transmitted. In our notation, e stands for the fraction (of n) of insertions, and S for the fraction of deletions. In 0, GU-GE the fractions 
of insertions and deletions vanish with n, hence the corresponding variables are denoted t n and S n . 12 This column has additional remarks 
on specific works. 


to the cursor moving “from left to right”, and at each point, either a uniformly random symbol is inserted, the symbol at 
the cursor is deleted, or the cursor jumps ahead without changing the previous symbol. This model attempts to capture a 
”one-pass/streaming” edit processQ 

We also study an Arbitrary Pre-Edit Sequence, Arbitrary InDel (APES-AID) process. In this model, the old file is modeled 
as an arbitrary sequence over an arbitrary alphabet A. The post-edit source sequence Y is generated from the pre-edit source 
sequence X through an arbitrary/“worst-case” InDel process - we require that the number of edit operations is at most a 
small (but possibly constant) fraction of the file length n. The sequence of edits (insertions and deletions) is arbitrary up to an 
upper bound on the total number, occurs in arbitrary positions, and inserts arbitrary symbols from A for edits corresponding 
to insertions. Both these models are described formally in Section 1II-B1 

In both our models, we consider arbitrary alphabet sizes. We first prove information-theoretic lower bounds on the 
compression rate needed so that the decoder is able to reconstruct Y for both models. To do so we build non-trivially on recent 
work on the deletion channel d in the random pre-edit sequence/edit model (see Theorem 0, and provide a combinatorial 
argument in the arbitrary pre-edit source/edit model (see Theorem |9). We then design “universal” computationally-efficient 
achievability schemes based on dynamic programming (DP) and entropy coding (see Theorems Ho]& DU The compression 
rate achieved by the DP scheme is an explicitly computable additive term away from the lower bound for almost all alphabet- 
size^, and number of edits. In the regime wherein the number of edits is a small (but possibly constant) fraction of the length 
of X and the alphabet size is large, this term is small (details in Section HV-Bb . 

4 More general/realistic sources/Markov edit-processes are the subject of our ongoing research. 

' In the random source/edit model, we actually have no restriction on the alphabet-size; in the arbitrary source/edit model, for technical reasons, our bounds 
hold only for alphabets of size at least 3. 
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B. Related work 

Various models of the file-synchronization problem have been considered in the literature - see Table 1 for a summary. 
Our work here differs from each of those works in significant ways. For instance, in our model the encoder knows both files, 
hence we design one-way communication protocols (rather than the multi-round protocols required in the models where the 
encoder and the decoder each has one version of the file as in 01, 0, 0, CI1-II3); hence our protocols are information- 
theoretically near-optimal (however for two-way communication model, computationally efficient schemes which achieve rates 
with constant factors to the lower bounds are already challenging). The one-way communication model studied in flOl . fl4l is 
the closest to our RPES-LtRRID model. For the information-theoretical lower bound, we differ from H3 by considering both 
insertions and deletions, and arbitrary alphabet. The achievability scheme in ITOl matches the lower bound up to first order 
term for the random source/edit model, whereas our scheme is “universal” for both RPES-LtRRID and APES-AID models in 
our work. The literature on insertion/deletion channels and error-correcting codes is also quite closely related - indeed, we 
borrow significantly from techniques in l H6l . |T9), |20j. 

There are two lines of related work. In file synchronization problem, the encoder knows X and the decoder knows Y. The 
purpose is to let the decoder learn X (the encoder may or may not learn Y) through communication (either two-way or one¬ 
way). In our file update problem, the encoder knows both X and Y, the decoder knows X. The purpose is to let the decoder 
learn Y by one-way communication. In 0, an interactive synchronization algorithm was introduced which corrects o(n) 
random insertions, deletions and substitutions in binary alphabet, where n represents the file size. This is adapted from their 
previous work lUTl which corrects o(n/logn) insertions and deletions. Their algorithm was used as a component in fl2l where 
the synchronization algorithm corrects a small constant fraction of deletions over the binary alphabet, and in 1T3 wherein the 
algorithm synchronized insertions and deletions under non-binary non-uniform source. A one-way file synchronization model 
was studied in US with Markov deletions in binary alphabet, in which an optimal rate in an information theoretic expression 
was proved. In ITO) . a one-way file synchronization algorithm was introduced (with both versions available at the encoder) 
that synchronizes random insertions, deletions and substitutions over the binary alphabet. 

In the insertion/deletion channel problem, the channel model there can be the same as our InDel process (there are many 
different ways to model the stochastic insertions/deletions in both problems). The purposes are different. In insertion/deletion 
channels, one need to choose the input distribution to maximize the channel capacity max p ( X) T(X; Y) = max p ( X) H( Y) — 
U(Y|X). In file updating problem, the input distribution is given (arbitrary and random in this paper). The purpose is to 
find the minimum amount of information Enc need to send to Dec min p ( Y |x) Ff(Y|X), where the probability p(Y|X) is 
determined by the InDel process. 


II. Model 

A. Notational Convention 

In this work, our notational conventions are as follows. We denote scalars by lowercase nonboldface nonitalic symbols such 
as c. We use uppercase nonboldface symbols such as X to denote random variables, and lowercase nonboldface symbols such as 
x to denote instantiations of those random variables. We denote vectors (sequences) of random variables or their instantiations 
by boldface symbols, for example, X and x are vectors of random variable X and its instantiations x respectively. We also 
denote matrices by uppercase boldface symbols. For example, an m by n matrix is denoted by M mX „, and when there is no 
ambiguity we abbreviate it by dropping the dimensions, such as M. An n by n identity matrix is denoted by I„. We denote 
sets by calligraphic symbols, such as S. The length of a vector X is denoted by |X|. The cardinality of a set S is denoted by 
|<S|. We denote standard binary entropy by H(-), that is, H[p) = —plogp — (1 — p) log (1 — p). All logorithms are binary. 

B. Edit Process 

1) Random Pre-Edit Sequence Left-to-Right Random InDel (RPES-LtRRID) Process: As noted in the introduction, many 
different stochastic models for source sequences and edit processes have been considered in the literature. In this work, we 
study a RPES-LtRRID process as shown in Fig. |T] which is motivated by the Markov deletion model in fl4l . It is an i.i.d. 
insertion-deletion process, a special case of a more general left-to-right Markov InDel process as shown in Fig. [2] Our results 
should in general translate over to other stochastic models as well in the regime wherein there are a small number of insertions 
and deletions. But for the sake of concreteness, we focus on the i.i.d. left-to-right random InDel process. 

• Pre-edit source sequence (PreESS): The source initially has a pre-edit source sequence X = (X\, X-i -..., X n ), a length- 
71 sequence of symbols drawn i.i.d. uniformly at random from the source alphabet A = {0,..., a — 1}. Finally, we append 
an end of file symbol X n +i = eof to the end of X. We denote the distribution of the pre-edit source sequence by p(X). 

• InDel process: As shown in Fig. [Q the InDel process is a Markov Chain with three states as defined in the following: 

- the “insertion state” Z: insert (write) a symbol uniformly drawn from A\ 

- the “deletion state” A: read one symbol rightwards in the pre-edit source sequence X, and delete the symbol; 

- the “no-operation state” fj: read one symbol rightwards in the pre-edit source sequence X, and do nothing. 
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Fig. 1: Left-to-Right Random InDel (LtRRID) process: Starting in front of the first symbol of X, at each step, the process inserts a symbol 
uniformly drawn from A with probability e, reads one symbol rightwards and deletes it with probability 5, reads one symbol rightwards and 
does nothing with probability 1 — e — <5. Note that an inserted symbol is never deleted in this process. In contrast, a deleted symbol might 
be inserted back right away, with probability epjy- The process stops when it reaches the end of file X„+i = eof. 


The edit process starts in front of X\ and ends when it reaches the end of file X n+ i = eof. This means that in our 
model, the total number of deletions plus no-operations equals exactly n. In addition there are a potentially unbounded 
number of insertions (though in our model the expected number of insertions in bounded)@ The number of deletions and 
insertions are random variables Ke> and Kj respectively. We describe the edit pattern of the InDel process by a pair of 
sequences E = (O n+K 1 ,C Kl ), where the edit operation pattern is O n+ K ’ £ {Z, A,fj} n+Kl and the insertion content 
is C Kl £ A Kl . The random (e,6)-lnDel process is an i.i.d. insertion-deletion process with P(Z) — e, P( A) = <5, and 
P(fj) = 1 — e — S. 

• Post-edit source sequence (PosESS): The post-edit source sequence Y Y(X, E) is a sequence obtained from X through 

the InDel process E = (O n+Kl , C Kl ). 

• Post-edit set : Given any PreESS X, any PosESS Y in A* * (any sequence over A of any length) might be in its post-edit 
set, albeit with possibly “very small” probability. In fact, for any X and Y, there may be multiple edit patterns that 
generate Y from X. We use p(Y|X) to denote the probability that the output of the random left-to-right InDel process 
generates Y from X (via any edit pattern). 

• Runs: We use the usual definition (see, for example Ell) of a run being a maximal block of contiguous identical symbols. 
Since we shall be interested in runs of several different sequences, to avoid confusion about the parent sequence we use 
S -run to denote a run in a sequence S. 



Fig. 2: General Left-to-Right Markov InDel (GLtRMID) process: a general three-state Markov Chain where transitions between any of the 
three states can happen with general probabilities. This results in an InDel process with unit memory. However, the block lengths of insertions 
and deletions are still geometrically distributed. This model is a subject of our ongoing research. 


2) Arbitrary Pre-Edit Sequence Arbitrary InDel (APES-AID) Process: 

• Pre-edit source sequence (PreESS): The source initially has a pre-edit source sequence X = (Xi, X 2 , ■ ■ ., X n ), an 

arbitrary length-n sequence in A". 

• InDel process: The InDel process consists of a sequence of arbitrary InDel edits E = (E 1 , E 2 ,..., Ek), where fc denotes 
the number of edits. For notational convenience we also use X 0 to denote X, and X , to denote the sequence obtained 
from Xo after the first j edits (E 1 ,..., Ej) for all j = 1, 2,..., k. An arbitrary InDel edit Ej = (Pj , Oj,Cj) consists 
of three parameters: 


6 Note that in our model a symbol that is inserted cannot be deleted, since the “cursor” moves on after inserting a symbol. This is just one of many possible 
stochastic InDel processes - we choose to work with this model since it makes notation more convenient - we believe similar results can be obtained for a 
variety of related stochastic InDel processes. 
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- the position of the cursor Pj £ {0,1, 2,, |X^_! |}, which is the positions between symbols (including in front of 
the first symbol and behind the last symbol) in the current sequence Xj_i; 

- the edit operation Oj £ {A}, where t indicates that the edit operation is inserting at the cursor position, and A 
indicates that the edit operation is deleting the symbol in front of the cursor ( when Pj = 0, the edit operation can 
only be an insertion, that is, Oj = t ); 

- the content of insertion C :j £ A IJ {nop}, which is an arbitrary symbol from A if the edit operation is an insertion, 
and “nop” if the edit operation is a deletion. 

The sequence obtained from X,_ | after the yth arbitrary InDel edit Ej is a function of Xj_i and Ej , and is denoted by 
Xj = Xj (Xj_i, Ej). The edit process defined as above is an arbitrary InDel process. If the edit process subjects to the 
constraint that there are at most en insertions and 8n deletions, it is called an arbitrary (e, 8)-InDel process. (Since the 
sequence length keeps changing, for clarity, the parameters are with respect to the length of the pre-edit source sequence.) 
Two special cases are the arbitrary e-insertion process (equivalently an arbitrary (e, 0)-InDel process), and the arbitrary 
8-deletion process (equivalently an arbitrary (0, i5)-InDel process). 

• Post-edit source sequence (PosESS): A post-edit source sequence , denoted by Y = Y(X, E), is the sequence obtained 

from X through an arbitrary InDel process E = {E i,..., Ef\. If the InDel process is subject to an (e, 8) -constraint, the 
post-edit source sequence is called an (e, 8)-post-edit source sequence. 

• (X, ( e,8))-post-edit set: Let 34,5 (X) denote the (X, ( e,8))-post-edit set - the set of all sequences over A that may be 
obtained from X via the arbitrary (e, <5)-InDel process. 

• Runs: The same as defined in the RPES-LtRRID model, a run is a maximal block of contiguous identical symbols. 
Remark: Note that in the APES-AID process, the order of insertions and deletions in the edit process is in general arbitrary. 

However, based on the following FactQ] we can simplify the model by separating the insertions and deletions. 

Fact 1. An arbitrary (e, 8)-InDel process can be separated to an arbitrary 8-deletion process followed by an arbitrary 
insertion process. 

The proof of Fact Q] is provided in Appendix [B] 

C. Communication Model 

The communication system is as shown in Fig. [3] We define the communication model for both RPES-LtRRID process and 
APES-AID process. For clarity, we state the model for the RPES-LtRRID process, and repeat for the APES-AID process using 
notation without bars. 


' {InDel process I"* * 


X 


i 

Y 


Enc 




■Y' 


Fig. 3: Communication model: The source has both the random PreESS X and the random PosESS Y, as discussed in Section Hl-B 1 1 The 
sequence Y is obtained from X through the random (e, 5)-InDel process discussed in Section lll-Bll The source encodes the source sequences 
(X, Y) into a transmission Enc(X, Y) and sends it to the decoder through a noiseless channel. The arbitrary PreESS X is available at the 
decoder as side-information. The decoder receives Enc(X, Y), and regenerates the arbitrary PosESS Y' from (Enc(X, Y), X). Here the bar 
superscript is used to denote the fact that the source sequences and edit process are as described in Section Hl-B 1 1 rather than Section III-B2I 
The communication model for the APES-AID model discussed in Section lTl-B2l is similar, except that the quantity {X, Y, Enc(X, Y), Y'} 
are replaced with {X, Y, Enc(X, Y), Y'}. 


In the RPES-LtRRID process model, the source has both the PreESS X and the PosESS Y. The PosESS Y is obtained from 
the PreESs X through a random (e, h)-In Del process. The PreESS X and PosESS Y are encoded using an encoder Enc. Its 
output is possibly any non-negative integer Enc(X, Y). Taking as inputs the transmission Enc(X, Y) and the PreESS X, the 
decoder Dec reconstructs the PosESS Y as Y'. The code C^ s comprises the encoder-decoder pair (Enc, Dec). The average rate 
R of the code C!f s is the average number of bits transmitted by the encoder, defined as 54xe.A" Ye.4* P(X Y) log |Enc(X, Y)|. 
A code C f r f is “(1 — P e )-good” if the average probability of error, defined as Pi'xg^n Yeyi*{(X, Y) : Dec(Enc(X, Y), X) 
Y}, is less than P e . A rate R e g is said to be achievable on average if for any P e > 0 there is a code for sufficiently large 
n such that it is (1 — P e )-good. The infimum over (over all n and corresponding C f r f ) of all achievable rates is called the 
optimal average transmission rate, and is denoted R* $■ 

In the APES-AID process model, the source has both the PreESS X and the PosESS Y. The PosESS Y is obtained from 
the PreESS X through an arbitrary (e, <5)-InDel process. The PreESS X and PosESS Y are encoded using an encoder Enc 
into a transmission Enc(X, Y) from the set {1, 2,..., 2 nR }, where R denotes the rate of the encoder Enc. Taking as inputs 
the transmission Enc(X, Y) and the PreESS X, the decoder Dec reconstructs the PosESS Y as Y'. The code Cff comprises 
the encoder-decoder pair (Enc, Dec). A code Cf s is said to be “good” if for every X in A" and Y in the (X, (e, 5))-post-edit 
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set, the decoder outputs the correct PosESS, i.e. Y' = Y. A rate R f j, is said to be achievable if for sufficiently large n there 
exists a good code with rate at most R e j. The infimum (over all n and corresponding Cff) of all achievable rates is called 
the optimal transmission rate, and is denoted R* s . 

Remark: For the APES-AID process, we require zero-error for the source code. Because we can achieve this stringent 
requirement without paying a penalty in our optimal achievable rate. Conversely, we allow “small” error in the RPES-LtRRID 
process. Because it is necessary to allow for “atypical” source sequences and edit patterns. 

III. Lower Bound 

A. RPES-LtRRID Process 

1) Proof Roadmap: Since the decoder already has access to the PreESS X, the entropy of Enc(X, Y) merely needs to equal 
i?(Y|X), the conditional entropy of the entire PosESS given the PreESS (see the details in Lemma |2). The challenge is to 
characterize this conditional entropy in single-letter/computable form, rather than as a “complicated” function of n - indeed the 
same challenge is faced in providing information-theoretic converses for any problems in which information is processed and/or 
communicated. For scenarios when the relationship from X to Y corresponds to a memoryless channel, standard techniques 
often apply - unfortunately, this is not the case in our file update problem. We follow the lead of 02D, which noted that for 
InDel processes that are independent of the sequence being edited (as in our case), characterizing //(Y|X) is equivalent to 
characterizing ff(E|X, Y). (Recall that E denotes the random variable corresponding to the edit pattern.) In fact //(Y|X) 
can be written as H( E) — //(E|X. Y). This is because of the aforementioned independence between E and X, and the fact 
that Y is a deterministic function of X and E. We argue this formally in Lemma [3 The entropy of the edit patterns II (E) 
equals exactly to the entropy of specifying the locations of deletions, and insertions and their contents (this is argued formally 
in Lemma |4] below). @ Since multiple edit patterns can take a PreESS X to a PosESS Y, the term H (E|X, Y) corresponds to 
the uncertainty in the edit pattern given both X and Y. The intuition is that disambiguating this uncertainty is useless for the 
problem of file updating, hence this quantity is called “nature’s secret” in Ca. For instance, given X = 00000 and Y = 000, 
the decoder doesn’t know, nor does it need to know, which specific pattern of two deletions converted X to Y; all the encoder 
needs to communicate to the decoder is that there were two deletions. In general, if a symbol is deleted from a run or the 
same symbol generating a run is inserted in the run (edits that shorten or lengthen runs in X), the encoder doesn’t need to 
specify to the decoder the exact locations of deletions or insertions in X-runs. 

However, characterizing IT(E|X, Y) is still a non-trivial task, since it corresponds to an entropic quantity of “long sequences 
with memory”. One challenge is that it is hard to align X-runs and Y-runs. In other words, it’s in general difficult to tell 
which run/runs in X lead to a run in Y (we call this run/runs in X the parent run/runs of the run in y ED). We develop the 
approach in |fl6ll : 

• We first carefully “perturb” the original edit pattern E to a typicalized edit pattern E (described in details below). 

• We compute the typicalized PosESS Y corresponding to operating the typicalized edit pattern E on the PreESS X. 

• We show via non-trivial case analysis and Lemma [3 that with a “small amount” (0( max(e, 5) 2 n) bits) of additional 
information, X and Y can be aligned. 

• We show two implications of the above alignment: Lemma [6] provides a bound on // (E|X. Y), and Lemma [7] shows that 
i?(E|X, Y) is “close” to tf(E|X, Y). 

Pulling together the implications of the steps above enables us to characterize H (Y|X), up to “first order in e and 5”. We 
summarize the steps of our proof in Fig. Q] 


(E,X,Y) < E ° > (E,X,Y) 


Lemma 2 (Fano’s) Computed in Lemma 4 

I L 3 ( 

nR € ,s > H{ Y|X) H(E) — if(E|X, Y) 

Fig. 4: Flowchart of the proof: The natural lower bound of the amount of information that the encoder needs to send to the decoder is given 
by the conditional entropy H( Y|X), which we show in Lemma [3 equals to the amount of information to describe the edit pattern H( E) 
subtracts an amount called “nature’s secret” f/(E|X, Y). We characterize H( E) in Lemma [3 To characterize nature’s secret H(E|X, Y), 
we perturb the edit pattern E to a “typicalized” edit pattern E. We show in Lemma[7]that nature’s secret f/(E|X, Y) is within at most an 
order 0(max (e, <5) 2 ) distance from the “typicalized nature’s secret” H( E|X, Y), which we characterize in Lemma[3 

One major difference between our work and the analysis in lfT6ll is that since we consider both insertions and deletions, our 
case-analysis is significantly more intricate. Another difference is that we explicitly characterize our bounds for sequences over 

7 Recall in our left-to-right InDel model a symbol that is inserted will not be deleted. Even in other models, the reduction in the entropy of E due to 
interaction of insertions and deletions would be a multiplicative factor of e X <5, which is a “higher-order/smaller” term than the terms we focus on in this 
work, in the regime of small e,5. 


Bounded in Lemma 6 


(X.Y) 


e l 


(X.Y) 


Bounded in Lemma 7 

= H (E) — H (E|X, Y) + H (E|X, Y) — H (E|X,Y) > if(E) — ff(E|X,Y) — 2ff(E c j — H(A^^) 
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all (finite) alphabet sizes, whereas ED concerned itself only with binary sequences. Also, besides the difference in models and 
techniques, the underlying motivation differs. The authors of ifTSll focused on characterizing the capacity of deletion channels 
(and hence they could choose arbitrary subsets of PreESS). On the other hand we focus on the file update problem (and hence 
our “channel input” PreESS X is drawn according to source statistics). 

2 ) Proof Details: Recall in the InDel model (described in Section II I-B 11 ). the total number of deletions and no-operations 
equals n, with probability of an edit to be a deletion and to be a no-operation (conditioning on that the edit is not an insertion) 
equals y^- and respectively. Hence, the total number of deletions Kjj follows a binomial distribution B(n, y^-) with 

mean jzrti. Recall that in our model we allow insertions in front of the first symbol and after the last symbol - this is the 
reason why the index of number of insertions Kj is parametrized by (n + 1) rather than n in the following. The distribution 
of the number of insertions in the beginning of the InDel process and after each deletion or no-operation is Geoo(l — e), the 
geometric distribution on the support of {0,1, 2,...} with parameter (1 — e) (22] . The InDel process stops when the total 
number of deletions and no-operations is n. Hence, Ki is the sum of n 4- 1 i.i.d. random variables whose distributions follow 
Geoo(l — e). On the other hand, Ki is the number of insertions with probability e until n + 1 deletions/no-operations occur, 
which follows a negative binomial distribution NB(n + 1; e) with mean (n + l)y^- (;22|. 

Throughout this section, because we deal with sequences with random lengths, we use Theorem 3 in l23ll multiply times. 
Hence we restate the theorem here as a preliminary for our later proofs. 

Theorem 1. i23l/ [Theorem 3 (Determined Stopping Tune)] A stopping time N is said to be a determined stopping time for 
the i.i.d. sequence Xi, X 2 , ■ ■ ■ if {N = n) £ ct(Xi, X 2 ,..., X n ) for all n = 1,2,..., where cr(Xi, X 2 , ..., X n ) is the a-field 
generated by X \, X 2 ,..., X n . Then, for a determined stopping time N, 

H(X N ) = ElNjHiXi), (1) 

where X N £ A* denotes the randomly stopped sequence. 

Lemma 2 (Converse). For the Random Pre-Edit Sequence Left-to-Right Random InDel (RPES-LtRRID) process, the achievable 
rate R e ^s is at least fT(Y|X). 

Proof: We firstly show a modified version of the conventional Fano’s inequality H (Y| Y') < 1 + P e log|Y|. Because we 
allow insertions in our model, the length of Y can be arbitrarily large as the block-length n grows without bound. Hence, the 
upper bound on the term II (Y|Y'. Y' f Y) < log |Y| in the proof of the conventional Fano’s inequality doesn’t work in our 
problem. We modify the Fano’s inequality bound the term by H(Y\Y\ Y 7 f Y) < II (Y). The PosESS Y is a sequence of 
symbols drawn uniformly i.i.d. from A, where its length (n — Kd + Ki) is a “determined stopping time” for the sequence. 
Hence by TheoremQ] H(Y) = (n — E[Kd\ + E[Kj]) log |*4| = log |*4|. Hence, our modified Fano’s inequality 

is (i-S \ 

ff(Y|X,Eic(X,Y))<l + P e f— n+^Jlogl^l < n o n , (2) 

where o n —> 0 as n —>• 00 . 

We have the following chain of inequalities, 

nR e ,& > H{ Enc(X,Y)) 

> iT(Enc(X, Y)|X) 

= if(Y|X) + H( Enc(X, Y)|X, Y) - JT(Y|X, Enc(X, Y)) 

= H{ Y|X) - H{ Y|X, Enc(X, Y)) 

(b) _ _ 

> fT(Y|X) - no n , (3) 

where equality (a) holds since standard arguments show that randomized encoders do not help. Inequality (b) follows from 
our modified Fano’s inequality as shown in Equation [2] 

Dividing both sides of Equation |3] by n deduce our converse. □ 

Lemma 3. The conditional entropy i?(Y|X) equals the entropy of the edit pattern H( E), less “nature’s secret” fT(E|X, Y), 
i.e., H{ Y|X) = H (E) - iT(E|X, Y). 

Proof: 

H(Y\X) = iT(E|X) + fT(Y|X, E) - fT(E|X, Y) 

= H(E) + H{ Y|X, E) - fT(E|X, Y) 
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= 77(E) - jf(E|X,Y), 

where (a) is from the Chain Rule; (b) is because the edits E are independent of the PreESS X, and (c) is because the PosESS 
Y is a deterministic function of (X, Y). □ 

Lemma 4. lim^oo ^H( E) > iT(<5) + H(e) + elog \A\ + 2 min(e, 5) 2 ~ T + 0(max(e, 5) 2 ) 

Proof: Recall that E = (O n+Kl , C Kl ), where O n+Kl is an i.i.d. sequence with P{0\ = t) = e, P{0\ = A) = 5 and 
P{0\ = fj) — 1 — e — S. Hence, 

H(0\) = —S log S — eloge — (1 — e — S) log (1 — e — S) 

= H(6) + H(e) + (1 — 5) log (1 — S) + (1 — e) log (1 — e) — (1 — e - S) log (1 — e — 5) 

= H(6) + H(e) + (1 - <5) (log e) ( <5 - j - 0(S 3 )) + (1 - e)(loge)(-e -j- 0(e 3 ))- 
(1 - S- e)(loge)[-(<5 + e) - - 0{(6 + e) 3 )] 

= H(5) + H(e) — etiloge + 0(max(e, <5) 3 ), (4) 


where step (a) is by Taylor series expansion. Hence, 

lim -H( E)= lim - [H(O n+Kl ) + H(C Ki \O n+Kl )] 

n—foo fi n—>oo fl 

= lim ~[(n + ElK^HiOx) + H(C Kl \O n+Kl )\ 


n—>-oo fl 

(*>) .. i 


= lim -[{n + ElK^HiOi) + H(C Ki \K!)] 

n—f oo fl 

1 00 

= lim -[(n + E[K I ])H(0 1 ) + 'yH(C Kz \K I = k)Pi(K I = k)] 

n—f oo fl • 


fc—0 
oo 


= lim - [(n + E[iC 7 ])7T(Oi) + V H(C k ) Pr (Kj = k)] 

n—f oo fi * ^ 


k—0 

oo 


= lim -[(n + E[K I ])H(0 1 ) + 'ykH{C 1 )PT(K I = k)] 

n—f oo fl • 


/c=0 


= lim -[(n + E[iT / ])F(Oi)+JT(C'i)y A:Pr(iTj = fc)] 

n—foo fi • ^ 


k—0 


= lim i[(n + E[A 7 ])i/(0 1 ) + E[iC / ]ET(C 1 )] 

n—>• oo fi 


(d) 1 

= lim — 


n—>• oo fl 

1 


77 + € H(Oi) + {n+ 1)—^— log |-4| 
1 — e 


1 - e 

(e) 1 


1-e 


1 -e 

+ elog |.4|) 

(iT(<5) + A(e) + elog |*4| - e<51oge + 0(max(e, S) 3 )) 


*■= fT(5) + H(e) + elog |„4| — e<5log(5 — e 2 loge + (log e + log |.4|)e 2 + 0(max(e, i5) 3 ) 
> H(S) + H(e) + elog |.4,| + 2 min(e, S) 2 ~ T + 0(max(e, <5) 2 ), 


where equality (a) is because by Theorem 3 in (23), n + Kj is a “determined stopping time” for the i.i.d. edit sequence 
0 u O 2 ,..., hence H(O n+Kl ) = (n + E[Ki])H(Oi). Equality (b) is because given the edit operation sequence O n+Kl , 
the insertion content sequence C Kl depends only on the number of insertions /\/@ From equality (b) to equality (c) is by 
expanding Kj and noting that C Kl is a sequence of i.i.d. variables. Equality (d) is by Fact ??(a) and noting that the content 
of insertions are uniformly drawn from the alphabet. Equality (e) is by Equation [4] Equality (f) is by taking the Taylor series 
expansion of H(S) and H(e). □ 

As discussed in Section Ull-Al I and Fig. [4] the next quantity we need to calculate/bound is the “nature’s secret” // (E|X. Y) 
of the edit process. However, this quantity is in general difficult to calculate because X and Y are unsynchronized. Hence 


'Equivalently, H{C k i\O n+K i ) = H{C k i \O n+K i, K r ) = H(C k i\K i ) + H(O rL+K r \C K P Kf - H(O n+K i\K I ) = H(C k i\K i ). 











we perturb the edit process E to a “typicalized edit process” E, for which an analogue of nature’s secret 7T(E|X, Y) can be 
calculated (see Lemma [6] for details). We now formally define the typicalized edit process E and some sequences that depend 
on E: 

Definition 1 (Typicalized edit process). The typicalized edit pattern E is determined from (X. E) by choosing a subset of the 
edits in the original edit pattern E in the following way. The extended run of a run in X includes the run and its two 
neighbouring symbols, one on each side. Given (X, E), for all ~X_-runs, count the number of edits per extended run f] If there 
is no more than one edit in the extended run, the edit pattern in this run is set to be the same in the typicalized edit pattern. 
If there is more than one edit in the extended run, the typicalized edit pattern E has no edits in that run, that is, the 'K.-run 
and the corresponding Y-run are identical. 

Remark: 

• Whether to eliminate the deletions of neighbouring symbols or not is decided by checking the extended runs of the runs 
they belong to. For example, for E : 0/11)22.3, there are two edits in the extended run 01112 of the second run 111, 
hence the edit in the first run - the deletion of the left-most 1 - is eliminated in E. The right-neighbour 2 of the run 111 
belongs to the third run 22, whose extended run 1223 contains only one edit. Hence, the deletion of the right-neighbour 
2 of the run 111 is not eliminated in E. The typicalized edit pattern in this example is E : 0111/23. 

• An insertion that occurs at the boundary of two runs is contained in the extended runs of both the run at its left and the 
run at its right. If there is more than one edit in at least one of the extended runs it belongs to, the insertion is eliminated 
in E. For example, for E : 0111' 1 ' 4 22/, in the extended run 01112 there is only one edit - the insertion of 4 in front of 
the right-neighbour. However, in the extended run 1223 there are two edits, the insertion of 4 is eliminated in E. The last 
symbol 3 is the right-neighbour of the run 22, hence its deletion is not eliminated in E. The typicalized edit pattern in 
this example is 011122/. 

Denote the number of insertions and deletions in the typicalized edit process E by Ki and Kd respectively. Since in our 
model the way we define edit patterns ensures that the sum of the number of deletions and no-operations in any edit pattern 
(including typicalized edit patterns) always equals exactly n, the length of E equals n + Kj. 

Definition 2 (Typicalized PosESS). The typicalized PosESS Y is the post-edit source sequence obtained by operating the 
typicalized edit pattern E on the PreESS X. The length of Y equals n — Kd + Kj. 

Definition 3 (Complement of the typicalized edit process). The complement of the typicalized edit process E c = 
(0” +A/ Kl ,C_ Kl Kl ) is defined to specify the eliminated edits, where & 1+Kl Kl £ { —, I, Y\ n + Kl ~h'i specifies the positions 
and operations of the eliminated edits and C_ Kl Kl £ specifies the contents of eliminated insertions. 

Fig. 0 shows an example of all the sequences we define above. We will reuse this example later multiple times to explain 
different concepts. Fig. [6] shows the dependencies of all the sequences we define above, and some internal random variables 
we define and use in the later proofs. 

We first show that Y-runs can be “mostly” aligned to the parent run/runs in X. The intuition is that since X-runs undergo 
at most one edit in the typicalized edit process, for any Y-run, there are only a few possible cases for its parent run(runs), 
and the corresponding length(lengths). There are only two events where the cases of the parent run-length intersect, which we 
call the “ambiguous local alignment” events. An ambiguous local alignment event might be resolved by keeping aligning both 
possible alignments, until for one alignment no typicalized edit pattern can convert X to Y. Otherwise, both local alignments 
are possible and results in different “global alignments”. Hence, one can align (X, Y) in a left-to-right manner by checking 
the lengths of Y-runs and X-runs, with the aid of some extra information indicating which global alignment it is. Fig. [8] gives 
an example where an ambiguous local alignment is resolved by aligning further runs; Fig. [9] gives another example where 
an ambiguous local alignment is not resolved hence leads to two possible global alignments. Once (X, Y) are aligned, the 
uncertainty of the typicalized edit pattern E only lies in the positions of insertions that lengthen runs (insertions of the same 
symbol as in the run) and deletions within the runs where they occur. 

For a length-Z Y Y-run, its possible parent run/runs are categorized into the following cases, as shown in Fig. [7] (in all cases 
we give examples corresponding to the length-Z Y Y-run being 00000): 

• Case 1: The parent run is a “single run” with length / x . 

- Case 1.1 (1-parent-O-edit): No edit in the parent run, hence / x = ^y- Eg: 00000 —l► 00000. 

- Case 1.2 (1-parent-1-ins): One insertion in the parent run, hence Z x = — 1. Eg: 0(L°0() —► 00000. 

- Case 1.3 (1-parent-1-del): One deletion in the parent run, hence Z x = Z Y + 1. Eg: 000000 —> 00000. 

‘^Deletion of any symbol in the extended run (including deletion of either of the two symbols neighbouring the X-run) adds one to the count. Insertion 
of a symbol adds one to the count only if the insertion happens to the right of the left-neighbour of the X-run, and to the left of the right-neighbour of the 
X-run. Note that insertions that occur between two runs are therefore counted once in both X-runs, since they are in the extended run of each X-run. 
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Fig. 5: Example of the defined file and edit sequences: The first row shows a length n = 13 PreESS X sequence over the alphabet 
{0,1, 2, 3,4, 5}. The second row shows in shorthand edits performed on X. The third row shows the corresponding edit pattern E. As 
defined in the model section, insertions are represented by t, deletions by A, and no operations by fj. Here, for the sake of brevity we abuse 
notation by representing the contents of insertions as subscripts to the corresponding 1, rather than as a separate C K '. For instance in the 
example in this figure, the operation of inserting a 4 after the fifth symbol is represented by t 4 . Since there are Ki = 3 insertions in the edit 
sequence, the length of the edit sequence E equals n + 3 = 16. The resulting PosESS sequence Y is shown in the fourth row. Note that X 
has 6 runs - 000, 1111, 22, 3, 2 and 33 (single symbols distinct from their neighbors also count as runs). The corresponding extended runs 
are respectively 0001, 011112, 1223, 232, 323, and 233. The number of edits in each of these runs is therefore respectively 1, 3, 1, 0, 0, 1, 
and in the corresponding extended runs is 1, 4, 1, 0, 1, 1. Hence the only edits eliminated from E to get E are the three edits in the second 
X-run (since the corresponding extended X-run has 4 edits and by our definition typicalized edit patterns may only have at most one edit 
per extended run). The “complement” of the edit process therefore has blanks — everywhere except in the locations corresponding to the 
three edits in the second run of X, as shown in the fifth row. The sixth row shows the typicalized edit process (with all the edit operations 
present in E, except those corresponding to the three in the second run of X. Finally, the last row shows the resulting typicalized PreESS 
Y resulting from operating E on X. 



Fig. 6: The dependency of all the sequences and internal random variables for the proofs. 


« Case 2 (sub-parent): The parent run is a “sub-run” of a length-/^ run, that is, an insertion of a different symbol in the 
middle of a parent run breaks it into two runs. In this case, Zx > ly- Eg: OOOOO^OOO ->• 000001000. Moreover, the next 
run in Y after this lcngth-Z Y Y-run is also aligned to this X-run. 

• Case 3 (multi-parent): There are 2t+l parent X-runs of this Y-run. Of these parent X-runs, t+1 runs (the odd-numbered 
ones among the 2Z + 1 X-runs) comprise of the same symbol (0, in this example) as the corresponding Y-run, and are 
of lengths Zi,..., Zt+i respectively (say). Interleaved among these are the even-numbered X-runs, comprising of just one 
symbol each, that must be different from the symbols (0 in our example) that comprise Y. In this case, all the length-1 
even-numbered X-runs get deleted and there is no edit in the other t+1 odd-numbered X-runs (of the same symbol as 
in this Y-run), hence Z^ = Sj=i h anc l ^X = h < £-$-• Eg: 00/00/to —t 00000. 

Noting the parent run/runs lengths in all the above cases and examining the run lengths of Y and X in a left-to-right manner, 
the runs in Y can be “almost” aligned to the parent run/runs in X, except for the following two ambiguous local alignment 
events. We show later that with the help of some “small amount” additional information //(/l x ^.), (X, Y) can be aligned. 

• Ambiguous local alignment type-1 T 1 (Zx = Z^ — 1): Recall Case 3 (Zx < Z^), when t — 1 and Zx = h = ly — 1, h = 1, 

the length of the X-run is the same as in Case 1.2 (Zx = Z^ — 1). Hence, when finding the length of the to-be-aligned 
X-run for a length-Z Y Y-run to be ly — 1, one cannot tell immediately whether it is Case 1.2 or Case 3. 

• Ambiguous local alignment type-2 T 2 (Zx = ly + 1): Recall Case 2 (Z x > when Zx = ly + 1 an d the insertion of a 

different symbol occurs in front of the last symbol of the X-run, leading to a length-Z Y-run, the length of the X-run is 
the same as in Case 1.3 (Zx = ly + 1). Hence, when finding the length of the to-be-aligned X-run for a length-Z^ Y-run 
to be ly + 1, one can’t tell immediately whether it is Case 1.3 or Case 2. 
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single parent run 



Fig. 7: Given a Y-run (00000) with length iy, its parent run may be a single run, a sub-run, or several runs. Because there can be no more 
than one edit in an extended run in the typicalized edit process, we can explicitly find the forms of the edits in different cases. If the parent 
run is a single run with length l x , there may be no edit (l x = Z Y ); one insertion ( l x = Z Y — 1); or one deletion (Z x = Z Y + 1). If the 
parent run is a sub-run with length Z x , there must be one and only one insertion in the parent run, which breaks the parent run into two 
runs with length l Y and Z x — l Y . In this case, Z x > l Y . If the parent runs are several runs where the length of the first run is Z x , there 

must be 2f + 1 parent runs (t > 1), where the odd-number runs are runs with symbols the same as the Y-run, and the even-number runs 

are lenth-1 runs of symbols different from the Y-run. In this case, Z x < Z Y . 

X0001111223233 
Y00101111232333 
Alignment 1: fl 0 0 1^° 1 1 1 2 

Alignment 2: 0 O'*" 1 0 1 1 1 1 2 

Fig. 8: Ambiguity resolved: l)There is an ambiguous local alignment type-2 event (Z x = Z Y +1) in aligning the first X-run and Y-run. The 
first Y-run (00) is of length 2, and the first X-run (000) to be aligned with the Y-run is of length 3 - they are comprised of the same symbol 
0. The edit in the first X-run may be Case 1.3 (single-deletion) or Case 2 (single-insertion breaking the X-run). We therefore examine the 
next symbols in X and Y. 2)In fact, even if we examine the next one or two symbols in X and Y, the local ambiguity is not resolved. 
The symbol after the first Y-run (00) is a 1, the same as the symbol after the first X-run (000), which means Case 1.3 (single-deletion) is 
possible. The second symbol after the Y-run (00) is a 0, the same as the symbol the first Y-run (00) is comprised of, which means Case 
2 (single-insertion breaking the X-run) is possible. 3)Ambiguity is resolved by aligning the second X-run to Y. Alignment 1: This must 
mean that a 0 was inserted after the first 1 in the second X-run (1111), breaking it into two runs of l’s in Y separated by a 0 (respectively 
the third to the eighth symbols in Y). This scenario is shown in the third line of the figure above. Since the second X-run had four l’s, 
the resulting Y-run have three more l’s, with no more edits (since it is a typicalized Y-run). However, there are four l’s in Y after the 
“inserted” 0. Hence, alignment 1 is not possible. Alignment 2: The first three runs in Y (0010) are aligned to the first X-run. The next 
X-run and Y-run to align both have four l’s, hence can be aligned correctly and unambiguously. 


Note that the ambiguous local alignments might be resolved when aligning further X-runs and Y-runs. Not all local 
ambiguous alignments lead to different global alignments. The example in Fig. [8] and Fig. [9] show both the scenario when an 
ambiguous local alignment is resolved later, and the scenario when an ambiguous local alignment leads to different global 
alignments. 

We formally define the global alignment (we sometimes call it alignment for short) of a pair of PreESS and typicalized 
PosESS (X, Y), and also the partial alignment of their subsequences. 

Definition 4 (Global Alignment). Let the number of runs in a typicalized PosESS Y be denoted by p Y . The typicalized PosESS 
Y can then by decomposed into Y-runs as 

Y = Y(1)Y(2)...Y( Py ). (5) 

We then divide X into “segments that leads to corresponding Y-runs” as 

X = Xy(1)X y (2)...A>(py)- (6) 

Note that X Y (i)’s are in general not runs of X. For any Y(i ) that is created by insertions, set the corresponding X Y (i) to be 
an empty run (j> with length 0. For any X-run that is deleted and the two neighbouring runs of it on both sides are comprised 
of different symbols, we force it to join the segment of its right neighbouring run. The alignment of X and Y is defined by 
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X0001111223233 
Y00101111232333 
Alignment 1: 0 O^ 1 0 1 1 1 1 $ 2 3 2 3^ 3 3 

Alignment 2: 2^ 3 2 3 ^33 

Fig. 9: Ambiguity unresolved: The edits in both Alignment 1 and alignment 2 convert X to Y. The challenge therefore is to characterize 
the probability of such local ambiguity being globally unresolvable. This is the thrust of Lemma [5] 

the vector of the lengths of the segments X^-(i)’s, 

^X.Y = (I*y(1)I, I^y(2)|, ' ■ ■ ! I-^y(Py)I)' (7) 


Definition 5 (Partial alignment). For the subsequence of a typicalized PosESS Y consisting of the first / Y runs 
Y(1)Y(2) ... Y(?y) where * Y < p y, suppose the segments of X that lead to the Y -runs are X Y (1)X Y (2)... X Y (i Y ). 
The partial alignment of X and Y upto “depth” i Y is defined by the vector of the lengths of the segments X^-(i)’s, 

•' x y y = (|**(1)|, I^y(2)I,• • •, I^y(*y)D- (8 ) 


Recall that “nature’s secret” is the uncertainty of the edit pattern given PreESS and PosESS. We now bound the “nature’s 
secret” of the typicalized edit pattern fT(E|X, Y) from above by // (E. /l x Y |X. Y). We further bound the latter quantity from 
above by the sum of the two terms: the uncertainty //(.4 X Y ) of the global alignment, and the uncertainty II (E|X, Y, ,4 X Y ) 
of the typicalized edit pattern given the global alignment. 

Lemma 5. lim^oo Y ) < 0(max(e, <5) 2 ). 

Proof: The intuition that the uncertainty H(A x Y ) of the global alignment is “small” is as follows. In any ambiguous local 
alignment event T = F 1 U I’ 2 , one of the two edit patterns has an insertion and the other has a deletion. Hence “locally” the 
positions of the output Y by applying these two edit patterns to X differ by a shift of two positions. If the matching procedure 
described above in Fig. [TO] keeps aligning X w.r.t. Y via both edit patterns, the ambiguity is still not resolved. That means 
we can find at least two distinct typicalized edit sequences that convert two “similar” sections of X which differ by a shift of 
two positions to the same section of Y. This means that some symbols (it turns out at least one out of every two neighbouring 
symbols) in one section of X determine the values of other symbols within a short block. This is because of the property of 
typicalized edits that “not too many” insertions or deletions (no contiguous insertions/deletions) can happen in a short block. 
Hence averaging over X, the probability that we need extra information to resolve ambiguous local alignments is “small”. 

In the following, we bound iT(H XY ) from above carefully. We first convert the uncertainty II (/1 X Y ) averaging over 
PreESS X and typicalized PosESS Y, to the number of “splits” (ambiguous local alignments unresolved) averaging over the 
PreESS X and edit pattern E, as shown in Equation® ( 1 1 4k Denote the number of x-runs by p x . For i = 1,2,.... p x , define 
the event Ei(fc,e) from the matching algorithm - after typicalizing e to e and processing e on x, the ith x-run encounter an 
ambiguous local alignment, and for the subsequence starting from the first symbol after the zth run and ending at the symbol 
before the next edit in e (we call the length of this block in x the “gap”), the ambiguous edit pattern at the zth run can 
obtain the same y through some typical edits. If Ei(x, e) does occur, it may cause a split on the path of alignment where e 
belongs to, in which case one bit is needed to distinguish between the two ambiguous edit pattern. Hence, the total number 
of bits needed to distinguish the path/alignment associate with e from other paths splitting from it is bounded from above 
by Ef= 1 1 Ei(*,e). For i = 1.2,.... p x , denote the length of the /th x -run by /,. Conditioning on that an ambiguous local 
alignment T (,j occurs to the /th x-run, and the “gap” g from the symbol after the /th x-run until the symbol before the next 
edit, the probability Pr(Ei(x, e) 11'/. longest gap g) only depend on x and g. We denote this probability averaged over X by 
b >r s = Exex Pr(x) Pr(Ei(x, e)|r(q,longest gap g) and bound Pr ff later through some case analysis. 

^(4,Y) - E xe X PrW 

= V" _Pr(x)V . _ Pr(e)') H(A X y) 

^x£X V ' ^yeY(x) Y^-'VeGE,(x,e)->-e->-y ) y 


(9) 

( 10 ) 
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Fig. 10: The flowchart of the align module to align X and Y : The module takes in X and Y as inputs, and outputs all the possible 
alignments y as a binary tree of depth p^. Any path of the output tree of length p^ is a global alignment of (X, Y) as defined in 
Definition 0 any partial path starting from the root of the tree with length l p a < p^ is a partial alignment upto depth l p a as defined 
in Definition [5] In the process of aligning (X,Y), when an ambiguous local alignment occurs, the process keeps both edit patterns and 
continues aligning further runs with both alignments - this leads to new loops of the algorithm and possible new branches (splits) on the 
tree A ^ y if the ambiguity is not resolved by aligning further runs. 


< V _ Pr(x) ^ 




XGX 
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( 12 ) 

(13) 

(14) 

(15) 

(16) 

(17) 

(18) 

(19) 

( 20 ) 
( 21 ) 
( 22 ) 


In equality (a), the set Y(x) is obtained through typicalizing the set Y(x) - all the sequences y(x) that resulting from 
processing any edit pattern E on x. In equality (b), we replace Pr(y|x) with the sum of the probabilities of all the edit patterns 
such that after typicalizing with x and processing on x obtains y. The inequality (c) follows by bounding the entropy of the 
tree A iy from above by the average of the number of splits A^pli^/^) on all the paths. Note that a path of the tree A xy is 
a certain global alignment of (x, y) - consisting of many typicalized edit pattern e, the probability of which is the sum of the 
probabilities of all the e resulting in e after typicalizing. The equality (d) follows by directly canceling X^VeGE (x e)->e->y Pr (c)- 
Equality (e) and (f) follows because by fixing x and e, we fix a path on the tree A^ y. Moreover, for all the e’s which fixing 
on the same path, Y S pjj|.(P^(x, e))’s equal. 

In the following, we calculate Pr g - conditioning on the occurrence of an ambiguous local alignment, the probability that 
the ambiguity is not resolved by continuing the matching process until the gap g - by breaking into four cases based on the 
type of the ambiguous local alignment and which edit is the edit that actual happens. Pr g is the probability that averaging 
over X and E, the path on the tree /l x Y splits into two branches at a node. 

• Ambiguous local alignment T 1 (/ x = / Y 1): W.l.o.g., assume the symbol in the run is 0 and the subsequence of X 
starting from the run is Oxi . 7 ; 2 . 7;3 .... The corresponding Y-run to be aligned is 00. There are two possibilities: 1) Case 
T 1 (i) - this possibility corresponds to an edit pattern resulting in 0^°Xi OOxi... with an insertion of 0. 2) Case 

T 1 (A) - the other possibility corresponds to the edit pattern in which case X\ is deleted and 0 combines with X 2 resulting 
in 00 in the corresponding locations in Y - 0 j 2 >r 0 x 3 •••—>• OOX 3 .... In this case X 2 must equal 0. In other words, if X 2 
is not 0, this edit pattern is impossible and the ambiguity is resolved. Averaging over p(X), this happens with probability 
■pq-. Moreover, this edit pattern results in either 0 jcy 0 x 3 ■■■—>■ OOX 3 ... (if X 3 is not deleted), or 0 jz>rO;&gX 4 ■■■—>■ 00 x 4 ■ ■ ■ 
(if X 3 is also deleted). 

Hence, the local ambiguous event happens only if either X 3 or X 4 is the same as x \ , which happens with probability 


1 - 


(W)' 


_ 2|-4|-1 
- |A| 2 ' 


- Case T 1 (f): The actual edit E is a single insertion l, and until the gap g there is no other edit: 

0"*" O xi 0 x 3 X 42:5 ... x g ■ ■ ■ —> 00 x 10 x 3 X 4 X 5 ... x g .... (23) 

In this case, the smallest g is 1, we denote g = 2t — 1 or 2t, where t — 1,2,.... The ambiguous edit is a deletion 
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of .f-| and should also result in the same Y through some typical edits: 

some typical edits 

Q2>['0x3X4X5 ... x g ■ ■ ■ —> OOX3X4X5 ... x g ... - > 00x 1 0x3X4X5 ... x g .... 


( 24 ) 


The symbol x± can equal any symbol from the alphabet but 0 , w.l.o.g. assume X\ = 1 . From the above, there should 
be some typical edits such that after applying these edits to the sequence x3X4X5 ... x g ..., the first g symbols of 
the resulting sequence should be 1 OX3X4X5 ... x g - a shift rightwards of two positions. In the following, we show 
that averaging over Pr(X), the probability that one can find some typical edits that shift a sequence rightwards by 
two positions and match up to length g decays with g. (These X’s are the ones that have splits in the tree A x Y 
along the paths with the E we are considering now.) 

We first argue that the shift rightwards of two positions can’t be accomplished before reaching the gap g. Firstly, 
typical edits only shift the sequence by one position at a time, because in typicalized edit pattern no contiguous edits 
can happen. Before the sequence is shifted rightwards by two positions, it must have been shifted rightwards by one 
position by an insertion. After the insertion makes the shift by one position, all the symbols after the insertion are the 
same and no other edits can happen (the symbols form a run). For example x^x4X5 ... x g ■ ■ ■ —» 1 Ox3x4a:;5 ... x g , 
the insertion of 0 shifts the sequence rightwards by one position. Because X3 cannot be deleted, X3 has to equal 
1 . Hence we have lI°X4X5 ... x g ■ ■ ■ —» IOIX4X5 ... x g . Also, X4 also has to equal 1 , because for typicalized edit 
patterns, X4 can not be deleted nor can an insertion happen in front of X4. By continuing the deduction, the symbols 
{x4, X5.... x fj } should all equal X3 = 1 and there can be no other edits among them because they form a run. 

We prove an upper bound on Pr g by induction. Recall that either X3 or X4 has to equal xi = 1 . Hence for g = 1 , 

Pri = 1 — = 2| | ^ I | I 1 ■ Assume for odd number g = 2 t—l where t = 1 , 2 ,..., the sequence X3X4X5 .. .x g ... 

can be converted to the shift of it rightwards by two positions up to the gap g - IOX3X4X5 ... x g . We look for what 
condition should hold for the shifted sequence to be able to match up to the gap g + 2 = 2 t + 1 . Because we argued in 
the last paragraph that the position (index) of the sequence won’t shift rightwards by two before the gap, the segment 
of sequence that convert to IOX3X4X5 ... x g ends at index at least g + 1 . If the index is g + 1 - X3X4X5 ... x g +i 
converts to IOX3X4X5 ... x g , from the last paragraph, to match two more symbols we have x g +3 = x g +2 = x g _|_ 1 with 
probability If the index is greater than g + 1 , for example g + 2 - X3X4X5 ... x g +2 converts to IOX3X4X5 ... x g , 
then among x g +3X g +4, at least one of them should be the same symbol as x g +i or x g _|_2. By conditioning on whether 

x g+ i and x g+2 equal, the probability is ^ • ^1 - ^ ^ = 4| ^ |~^ l+3 < 1. 

Hence we have Pr2t+i < 4 — • Pr2t_i. For even numbers g = 2 t where t = 1 , 2 ,..., we can bound the 

probability Pr g = Pr2t by Pr2t_i. Hence, we have Pr g < 2 for g — 2 t — 1 or 2 1 where 

t 1-2. 

- Case r 1 (A): The actual edit E is the deletion A of Xi, and until the gap g there is no other edit: 


0 j 2 >rOX 3 X 4 X 5 . . . Xg ■ ■ ■ —> OOX3X4X5 . . .Xg . . . . 


(25) 


In this case, X 3 can be deleted and the smallest g is 2. We denote g = 2t or 2t +1, where t = 1,2,.... The ambiguous 
edit is a single insertion of 0 in the run of 0’s and should also result in the same Y through some typical edits: 


O^Xl OX3 X4X5 . . .Xg ■ ■ ■ —> OOX1OX3X4X5 . . .Xg 


some typical edits 

- > OOX3X4X5 .. .x g .... 


(26) 


W.l.o.g., assume xi = 1. From the above, there should be some typical edits such that after applying these edits to 
the sequence IOX3X4X5 ... x g . .., the first g — 2 symbols of the resulting sequence should be X3X4X5 ... x g - a shift 
leftwards of two positions. 

With similar arguments as Case F 1 (t), the position/index of the sequence won’t shift leftwards by two positions to 
match the index of Y before the actual edit pattern has the next edit (before the gap). For the initial condition, Pr 2 = 1 
and Pr 3 = pj. By induction, for even numbers g = 2t where t = 1,2,..., Pr g +2 = Pr 
For odd numbers g = 2t + 1 where t = 1,2 
'4I.AI 2 —6|.4|+3 V _1 
PI 3 ) 


2t+2 < 


4PP-6PI+3 


, we can bound the probability Pr g = Pi^t+i by I. Hence we 
for g = 2 t or 2 t + 1 where t = 1 , 2 ,.... 


Pr 2 


have Pr g < 


• Ambiguous local alignment F 2 (/ x = Z Y + 1): W.l.o.g., assume the symbol in the run is 0 and the subsequence of X 
starting from the run is OOX 1 X 2 X 3 .... The corresponding Y-run to be aligned is 0. There are two possibilities: 1) Case 
r 2 (A) - this corresponds to an edit pattern resulting in O0Xi •••—»• Oxi... with an deletion of 0 in the run. 2) Case 
r 2 (i) - the other possibility corresponds to the edit pattern with an insertion of an symbol other than 0 in front of the 
last 0 in the run, breaking the X-run into two runs of 0 with length-/^ — 1 and length-1 - 0'*' t 0xi • ■ ■ —> OfOx’i.... 
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- Case r 2 (A): The actual edit E is a single deletion A, and until the gap g there is no other edit: 

O0X\X2X3 . . . Xg ■ ■ ■ —> 0*1*2*3 . . . Xg .... (27) 

In this case, the smallest g is 1. Denote g = 27 — 1 or 27, where 7 = 1,2,.... The ambiguous edit is an insertion of 
*i in front of the last 0 and should also results in the same Y through some typical edits: 

1 some typical edits 

Cr 1 0 *i* 2*3 ... x g ■ ■ ■ —> OX 1 OX 1 X 2 X 3 ... x g ... - > OX 1 X 2 X 3 ... x g .... (28) 

W.l.o.g., assume x\ = 1. From the above, there should be some typical edits such that after applying these edits to 
the sequence 01 x 2 * 3*4 ■ ■-x g ..., the first g — 1 symbols of the resulting sequence should be * 2 * 3*4 ... x g - a shift 
leftwards of two positions. 

This is similar as Case T 1 (A) - shift forwards of two positions. (The only difference here is the length of sequence 
needed to match after the shift is g — 1 istead of g — 2 in this case.) In this case we have Pr s < 
for g = 27 — 1 or 27 where 7 = 1,2,.... 

- Case r 2 (t): The actual edit E is an insertion of an symbol other than 0 in front of the last 0, and until the gap g 
there is no other edit: 

0"*' t 0xiX2*3 . . . X g ■ ■ ■ —> 070X1X2*3 ...x fl .... (29) 


In this case, the smallest g is 1. Denote g = 27 — 1 or 27, where 7 = 1,2,.... The ambiguous edit is a single deletion 
of 0 and should also results in the same Y through some typical edits: 


00*1*2*3 ■ ■ ■ X g ■ ■ ■ —t 0*i*2*3 ... Xg 


some typical edits 

- > 070*1*2*3 ... x g .... 


(30) 


The ambiguity only exists if the inserted symbol 7 equals x\. W.l.o.g., assume 7 = *1 = 1. From the above, there 
should be some typical edits such that after applying these edits to the sequence *2*3 . .. x g ... , and the first g + 1 
symbols of the resulting sequence should be 01*2*3 ... x g - a shift rightwards of two positions. 

This is similar as Case I’ 1 (7) - shift rightwards of two positions. (The only difference here is the length of sequence 

needed to match after the shift is g +1 istead of g in this case.) In this case, we have Pr ff < 4 ^7 4 • |3q^^ +3 ) 

for g = 27 — 1 or 27 where 7 = 1,2,.... 

From the above case analysis, for all four cases, we have Pr g < for g = 27—1 or g = 27 where 7 = 1,2,.... 

Flence H(A-^ y) < max(e,(5) 2 • 2 n ■ ]C<^li P r g = C ) (max(e, <5) 2 )n. 

□ 

Lemma [6] below characterizes the “nature’s secret” of the typicalized edit process as defined in Definition Q] 


Lemma 6. lim^oo ifT(E|X, Y) < C \^\(<$ + e)+C ) (max(e, 5) 2 ) , where C|_ 4 | 
constant that depends only on the alphabet size |^4|. 



I log l is a 


Proof: Knowing the global alignment of (x, y). the uncertainty in the typicalized edit pattern only lies in the uncertainty of the 
locations of single-deletions and the single-insertions of the same symbol (as in the run) within the x-runs. From the definition 
of the typicalized edit pattern, an x-run undergoes at most one edit. Hence, we define the following notations describing the 
edits from the x-runs perspective, which will be useful in calculating //(E|X. Y. ,4 X Y )■ 

For any PreESS x, recall that we denote the number of runs in x by p x , and the run lengths by {Zi, Z 2 , - - - ,Z Pi }. In the 
following, we derive the probability of insertions and deletions in the typicalized edit process from both symbol-perspective 
and run-perspective. 

For the symbol-perspective typicalized insertion/deletion probabilities, for any j = 1,2,..., denote Sj to be the probability 
that any specific symbol in the jth x-run is deleted, Sj = 5(1 — e — S) lj+1 € (5 — (lj + 1)(5 2 + e<5),5). Similarly, denote 
ij to be the probability that there is an insertion between two specific symbols in the extended run of the 7 th x-run, e :] = 
e(l — e — S) lj+2 G (e — (lj + 2)(e 2 + eS), e). Actually, we only need Sj < S and ij < e for upper bounding the “nature’s secret”. 
The specific distribution of the typicalized edit process is of interest for our future research on studying channel capacity of 
InDel channels. 

Note that in the typicalized edit process, an x-run either undergoes a single-deletion or a single-insertion. Hence, we 
derive the insertion/deletion probabilities from the run-perspective. For any global alignment a G {1.2,..., ; 9 S y }, denote 
D(a) ^ to be the run-perspective single-deletion pattern, where D^ j = 1 indicates there is one deletion in 

the jth x-run in global alignment a. Similarly, denote £ {0, l} Pi to be the run-perspective single-same-symbol- 

insertion pattern, where / sa me(a),j = 1 indicates there is one insertion of the same symbol (insertion that lengthens 
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the run) in the jth x-run in global alignment a. Dropping the subscript (a) in D( a ).j and / same (a),j> that is, Dj and 
Isamej are indicating random variables of single-deletion and single-same-symbol-insertion in yth x-run averaging over 
all global alignments respectively. For a pair (x, y), denote the event that processing a typicalized edit pattern e on 
x leads to y, p(y|x) = Evest(x e)^yP(®)- Moreover, all the typicalized edit patterns e that processing it to y - 
{Ves.t.(x,e) —> y} - are classified into /Jxy groups {E( 0 )} based on the global alignments, where E( a ) denotes the 
set of typicalized edit patterns e that belongs to global alignment a of (x,y). Hence, for all a G {1,2,...,^}, 

P(A *.y = a) = (EveeE (o) S.t.(x,e)^yKe))/(EveS.t.(x,e)^yKe)) = (EveeE Co) S.t.(x,e)^y P@)) I*)- HenCe ’ 

EyP(y|x)Ef=lK^x,y = a)P(D( a ),j = 1) = Ey Ef=t EveGE {a) S.t(*,6)->y P^)P( D (a)j = 1) = E fi p(e)p(Dj = 1) 

is the probability that there is one deletion in the yth x-run averaging over all the typicalized edit patterns, and equals IjSj. 
Similarly, Ey p(y| x ) Ea=i P( A *,y = a)p(I same ( a )j = 1) = Eef'(®)f(^ame,y = 1) is the probability that there is an insertion 
of the same symbol in the jth x-run averaging over all the typicalized edit patterns, and equals W ( l i + 1 )^- 


H{ E|X,Y,Hx y) = E F( x ,y,a)-ff(E|x,y,a) 

x,y,a 

= *52 p{x,y)p{a\x.,y)H{ti\x.,y,a) 


(31) 

(32) 


x,y,a 


= = a)Ff(E|x,y,a) 

a— 1 
fey 


x,y 


(a) 


Px 


= Y y) E P( A *’y = «) E ( D (a)J log lj + 4me(a)j log {lj + 1)) 


a— 1 


3 = 1 


E^ X ) E^M J2 P ( A *-y = a ) 51 (-°(a).f l0 S^ + J same(a)j log (Zj + 1)) 


(33) 

(34) 

(35) 


y 

Px 


i=i 


= E P( X ) E X! ^(y l X ) P( A *-9 = °) ( P( D {a),j = 1) log lj + P(4ame(a),j = 1) log {lj + 1)) (36) 


= E pW Y (fjh log l 3 + 1 ^| 1 tjih + !) log (h + !) 

X j= 1 

< Y E ( si i lo S l 3 + JX\ e ^ + ^ log ^ + ^ 

CO / 1 

(e) 


3 = 1 y 
Pi 


1 


j-i 


^VI-AI 


1 - 


;-i 




' logi+ Mi e "SvMI 


/-I 


1 - 


l-4| 




(37) 

(38) 

(Z + 1) log (Z + 1) (39) 

(40) 


where step (a) is because when the global alignment of (x, y) is known, the uncertainty only lies in the edit-positions in those 
x-runs undergoing single-deletion and single-same-symbol-insertion. Step (6) comes from the analysis in the last paragraph. 
Step (c) is because Sj G (S— + 1)(<5 2 +e<5), 6) and ij G (e— + 2)(e 2 + ed), e). (In fact, it is straightforward that 5j < S 

and ej < e, because the typicalized edit pattern is obtained from the original edit pattern through eliminating some edits.) Step 

... / x l-l / X 

is the run length distribution 
comes from 


(d) is because ExP( x ) E£=i h log h = E“i ipfeg Z, where p(l) = (l - i 

of X and E[L\ = 1/ (l - i 
changing the index l + 1 to l and some calculation. 

Finally, lim^ ±H(t\X, Y) < lim^ ±H (E, Ax,y|X, Y) = lim 


is the expectation. Similarly for Exf'W Ej=i(Z? + 1) l°g (Jj + !)■ Step (e) 


lim„ 


j H(Hx.y) + ^(E|X,Y,Hx.y) <(^ + e)^. ... 

i—i VI ' A| 


1 - 


l-4| 


fJ(Hx,Y|X,Y) + if(E|X,Y,Hx.Y! 

» 

l log l + 0{ max(e, <5) 2 ). 


□ 

In the following Lemma [7] we show that the nature’s secret for the original edit process is “close” to the nature’s secret of 
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the typicalized edit process. We first reprise a useful fact from Ell- 

Fact 2. RTF] [Fact V.25] Suppose U, U, and V are random variables with the property that U is a deterministic function of 
U and V, and also U is a deterministic function of U and V. (Denote this property by U U.) Then 

\H(U)-H(U)\<H(V). (41) 

Lemma 7. limn-^oo i|iT(E|X, Y) — iT(E|X, Y)| < 56max(e,5) 2 r + 0(max(e,5) 2 ) for any r > 0. 

Proof: We use Fact 0 to bound |iJ(E,X,Y) - H( E,X,Y)| by H(E C ). To do so, we map (E,X,Y) as U, (E,X,Y) 
as U, and E r as V in Fact [2] and further, show below that the conditions required in Fact [2] are satisfied. Similarly, by 
mapping (X,Y) as U, (X, Y) as U, and (E c , /l x Y ) as V in Fact [2] and showing below that the conditions required in 
Fact |2] are also satisfied, we can bound |iT(X,Y) - iT(X,Y)| by H(E C ,A X Y ). Hence, |iT(E|X,Y) - iT(E|X,Y)| = 

|(JT(E, X, Y) - H( E, X, Y)) + (H(X, Y) - ff(X, Y))| < H(E C ) + iT(E c , *) < 21T(E C ) + H(A X Y ). 

The detailed reasoning for the two pairs of the relations by the above mapping in Fact U is as follows. 

. (E,x,Y)<^(E,x,Y) 

- The typicalized edit pattern E as given in Definition |T] is a deterministic function of E and X. Then given E 
and X, one can compute the typicalized PosESS Y as noted in Definition [2] 

- To show that (E, X, Y) is a deterministic function of (E,X,Y) and E c , we proceed as follows. We firstly 
align the ’s and ‘A’s in E 6 ' with the ‘p’s and the ‘A’s in E. We then obtain E from E by changing the ‘p’s to ‘A’s 
where the corresponding symbol is As in E c , and inserting insertion edits ‘Fs of the corresponding content back 
where there are Ts in E c . The corresponding example is shown in Fig. [TT] The intuition is that the original edit 
pattern E is a “union” of the typicalized edits E and the eliminated edits stored in the complement of the typicalized 
edit pattern E r . After determining E, Y can be determined from (X, E). 

. (X, Y) * -^4(X,Y) ^ 

- With /l x y- the Y-runs can be aligned to parent run/runs in X without any ambiguity. Indeed, this is the 
content of Lemma [6] Also, the atypical edits E r can be aligned to X. Then given the typicalized PosESS Y and 
the atypical edits E c , one can reconstruct Y as follows. If the corresponding sections in E c for a X-run-Y-run 
match is “empty” (comprises only of ’), then we reconstruct the run/runs of Y as the same as the run/runs in Y. 
For the sections where the atypical edits E c are nonempty (has some eliminated insertions T/deletions ‘A’), the 
corresponding X undergoes some atypical edits in E, which are all eliminated in E. Hence the corresponding Y-run 
is exactly the same as the X-run. To reconstruct these atypical runs in Y, we only need to apply the eliminated edits 
specified in E c back to the corresponding X-runs. The corresponding example is shown in Fig. [12] 

- Although (X, Y) are in general hard to align, with the aid of E c , the O-subsequences of E c correspond to no 
edit-elimination parts in X. Hence the corresponding parts in Y remain the same in Y. The nonzero entries in E c 
specify the specific edit pattern in the X-runs where there are edit-eliminations. Those X-runs undergo no edits in 
Y. The alignment ,4 X Y helps with alignment E c to the X-runs. The corresponding example is shown in Fig. [13] 

eliminated insertions are restored by inserting I back 

E c — — 

E fj f] l\ fj 

E fj fj l\ fj 

eliminated deletions are restored by replacing fj with A 
Fig. 11 : Example of E ^— E 

In E c , there is an elimination of a deletion with probability = 5(1 — e — <5) i W>+ 1 < (1(f) + l)(e5 + 5 2 ), where l(f is 

the length of the run where Ej occurs. Averaging over X, denote the run length random variable by L, the probability that a 
deletion in E is eliminated is = El[^] < (E[L\ + l)(e5 + S 2 ). Note that E[L\ = J^- L < 2, where equality holds when 
|^4| = 2. Hence < 3(e5 + 5 2 ) < 6 max (e, 8) 2 

Similarly, there is an elimination of an insertion in E c with probability Cf- = e — e(l — e — 5) z ^>+ 2 < (Iff + 2)(e5 + e 2 ), 
where Lf is the length of the run where Ej occurs. Averaging over X, denote the run length random variable by L, the 
probability that an insertion in E is eliminated is = E l [(A\ < (E[L\ + 2 )(c8 + e 2 ). Hence ((- < A(e5 + e 2 ) < 8max(e,5) 2 


- i ah; 

- T 

,l r 

v i A ! 


iA 1 








— 



— 







177 1 

T 

V 

A 

V 

V 

V 

V 

1 3 

V 


V 

A 

V 

V 

V 

V 

h 

V 


18 






with Ay^ the alignment of X, Y is known 
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run-matches where E c 

are all — ’ 

3, Y 

is the same as 

Y 




otherwise, apply the eliminated edits specified in back to get Y 

(gC ^ 

Fig. 12: Example of Y < -—— Y 


X0001 1 11 2 23233 
E c - - - - A1 4 A - - - - - - - 

Y0010 1 41 232333 



- (E c ,^xy) ~ 
Fig. 13: Example of Y- 1 —> Y 


Recall Definition [3 that E c = (o n+Kl h '. Q_ K ' K1 By similar calculation as Equation [4] in LemmaQ] 

ff(Or) = ^(C-,C- r ,l-C--d) 

= + H( C- r ) - (log e)(-(- + 0(max(£—, C 1 ) 3 ) 

= -C-log (C 4 ) - (1 - C-) log (1 - C 4 ) + + 0( max (e, 5 ) 4 ) 

= -C^log (C~) - (1 - C~)(l°g e )i~C~ + 0(( A) 2 )) + H{£) + 0(max (e, <5) 4 ) 
= -C- log (C~) + (log e)C A - C- log (C 1 ) + (log e)C- + 0(max (e, <5) 4 ) 

< 12 max (e, 8) 2 ~ T + 16 max (e, 5) 2 ~ T + 0(max (e, S) 2 ) 

= 28 max (e, S) 2 ~ T + 0(max (e, 6) 2 ). 

Hence, 

H( E C ) = H{(T +Kl ~ k \<l Kl - kl ) 

= H(O n+Kl ~ Al ) + H (C K, ~ Ki \ o n+Kl - kl ) 

= (n + E[Kj] - EiK^HiOj + H(C Ki ~ Ki \(K t - K,)) 

= (n + E[Kj] - E[k I })H{0^ ] ) + (e[Kj\ - £[X/]) log \A\ 

<^(O l) + ^CMog|^| 

1 — e 1 — 6 


19 




















< (28max(e,<5) 2 r + 0(max (e, <5) 2 ) + 8 max (e, 5) 2 log |.4|^ = R— ^28 max (e, 8) 2 T + 0(max (e, <5) 2 )^ 

where step (a) is by Theorem Q] 

Hence, lim^ ±|tf(E|X, Y) - H(E\X, Y)| < lin Woo ± (2H(E C ) + H{A^)) < 56 max (e, S) 2 ~ T + G(max (e, <5) 2 ) 
for any r > 0. (Recall in the proof of Lemma [ 6 ] we’ve shown that // (/l x y) < 0(max (e, S) 2 )n.) 

□ 

Remark: For our purpose of finding a lower bound on the achievable rate, we only need one direction, that is, 
lim^oo i(i?(E|X,Y) — Ff(E|X,Y)) > — 56max(e,<5) 2r + 0(max (e, S) 2 ). Lemma [7] gives a stronger statement and 
will be useful for our ongoing research on insertion-deletion channel capacity. 

Theorem [ 8 ] below is the main theorem characterizing the information-theoretic lower bound of the optimal rate for RPES- 
LtRRID process. 

Theorem 8. The optimal average transmission rate for RPES-LtRRID process R* s = -H(Y\X) > H(S) + II (e) + 

1 \ 2 

is a constant that depends on the alphabet size |yf|. 

Proof: Combine Lemma [3] 0 [ 6 ] and [7] we have 

lim -F(Y|X)= lim -[iT(E|X) + iT(Y|E, X) - fT(E|X, Y)] (42) 

n—foo Ti n—foo 77 , 

= lim — [iT(E) — if(E|X, Y)] (43) 

n—foo 77 , 

= lim -H{ E)- lim -ff(E|X,Y)+ lim -(H (E|X, Y) - iJ (E|X, Y)) (44) 

n—»oo 77 , n —¥oo 77 , n—f 00 77 , 

> H(S) + H(e) + clog |.A| + 2 min(e, S) 2 ~ T — (5 + e)C\A\ — 56max(e, S) 2 ~ T + 0(max (e, S) 2 ) (45) 

> H(S) + H(e) + e log |.A| — (S + e)C|^| — 56 max (e, S) 2 T + 0 (max (e, (5) 2 ) (46) 

□ 

Remark: When e = 0 and \A\ = 2, our result matches with result in Corollary IV.5. for the binary deletion channel in (16). 
B. APES-AID Process 

Given an arbitrary pre-edit source sequence X £ A"\ recall that the X-post-edit set 34,5 (X) denotes the set of all sequences 
over A that may be obtained from X via an arbitrary (e, <5)-InDel process. For zero-error decodability. The encoder needs to 
send log|y e , 4 (X)| bits to decoder. The larger the X-post-edit set, the larger the corresponding lower bound on the optimal 

achievable rate. Hence to find a “good’' lower bound on the optimal achievable rate, one needs to find a pre-edit sequence X 

with a large X-post-edit set. 

In two special cases of the edit process, the arbitrary e-insertion process and the arbitrary (5-deletion process, the sizes of the 
post-edit sets have been well studied in literature. We here present the results in Ifl9l , |20| using our notation. For the arbitrary 
e-insertion process, the size of the post-edit set | 34 .o(X)| = ]Cj=o { n+ / n ) (|7l| — 1 ) J > R — l) en is independent of 

the PreESS X. For the arbitrary (5-deletion process, the size of the largest post-edit set |CVo ,5 (X) | > ]Cj"o ("~ 5n ) > ( n J^ n ) 
depends on the PreESS X. In the following, we give examples of the PreESSs and intuitions of the lower bounds for the two 
special cases. 

For an arbitrary e-insertion process, consider a PreESS that we denote X a , which is a single length-n run of the same 
symbol a £ A. Consider insertions of the form that of the n + en locations in the PosESS Y, exactly en locations correspond 
to insertions of symbols other than a. For such a PreESS X a and such insertion patterns, all the possible resulting PosESS 
Y are all distinct. The number of such insertion patterns is ( T 'R n ) (R — l) era . Hence, a lower bound on the number of 
PosESS |34,o(X a )| is (”^ n ) (R — 1 ) en . The corresponding lower bound on the optimal achievable rate - i log |34.o(X Q )|, 
is asymptotically (1 + e)H(j^) + clog (|*4| — 1) by Stirling’s approximation f24l . 

For an arbitrary (5-deletion processes, consider a PreESS that we denoted Xpjff, where each symbol is different from the 
preceding one, i.e., Xpjp- consists of n length-1 runs. Consider the set of deletion patterns which delet an arbitrary subset of Sn 
non-pairwise-contiguous symbols from Xpjq-. Note that each such deletion pattern results in a distinct PosESS Y. The number 
of these deletion patterns is (™7 <5n ). The corresponding lower bound on the optimal achievable rate - — log |34>,i(X ( jyj)|, is 
asymptotically (1 — S)H( prj) by Stirling’s approximation l24l . 


e log |_4,| — (5 + c)C\a\ — 56 max (e, (5 ) 2 r + 0(max (e, 5) 2 ) for any r > 0, where C\a\ = 




1 

R 


20 




To our best knowledge, there is no literature on the bounds for the scenario with both insertions and deletions. In the 
Theorem [9] below, we derive a lower bound on the achievable rate, by constructing a PreESS Xlb and a subset of InDel 
patterns, such that any of the InDel patterns in the subset, applied to Xlb, results in a distinct PosESS Y. 


Theorem 9. The optimal transmission rate of APES-AID process R* s > H(S) + 77(e) + e log |.4| — y^-e — (2 log e) max(e, 5 ) 2 + 
0(max(e,5) 3 ) + e.0((^) 2 ). 

Proof: Consider a PreESS Xlb constructed by alternating two symbols, for example 0101...01. This PreESS has largest 
possible number of runs (n), and is composed of least symbol from the alphabet (2). 

We describe a subset of arbitrary (e, 5)-InDel patterns that result in a “large” XLB-post-edit set. In this subset of InDel 
patterns, we require that all the 8n deletions precede all the en insertions. Next, we require that the deletions, and then the 
insertions, occur in a “left-to-right manner” (so that a cursor, so to speak, first deletes all the locations to be deleted sequentially 
from left to right, and then starts from the beginning of the shortened sequence again to insert symbols in an analogous left- 
to-right manner). Further, the deletions may delete any Sn non-pairwise-contiguous symbols (if a symbol is deleted, neither 
its two neighbor symbols will be deleted). Also each insertion may only insert symbols from {2,..., |.A| — 1}. 

It can be verified that each edit pattern results in a distinct PosESS Y, by noting that given X L b and Y, one can reconstruct 
the edit pattern. To do so, one first check for the “extra” symbols (those in the range {2,..., |.A| — 1 }) to identify the insertion 
pattern uniquely. Then one takes out those “extra” symbols, aligns the remaining sequence to Xlb and checks for the “missing” 
symbols ({0,1}) to identify the deletion pattern uniquely (because no pairs of neighbor symbols got deleted). The overall InDel 
pattern is then the left-to-right composition of the deletion pattern and insertion pattern. 

The number of such InDel patterns as described above is ( n ^") (™ _5 e ™ +en ) (1-4 ~ 2) cn , hence is a lower bound on the 
number of PosESS |34 ,<s(Xlb)|- The corresponding lower bound on the optimal achievable rate R* s - 4 log |34 ,<s(Xlb)|, is 

asymptotically (1 — S)H f + (1 — 8 + e)H ^ + e log (|^4.| — 2) by Stirling’s approximation (24). By expanding the 

binary entropy function and taking Taylor expansion, 


(1 - 8)H + (1 -8 + e)H (j—) + elog (|.4| - 2) 


(47) 


= (i-*)(—T^log- 5 


1-5 1—5 


1 - 25, 1 — 25\ . . / 

— >°e— + 


■log 


1-5 , 1-5 

log- 


5+e 1-5 + e 1-5 + e 1-5 + 


+ clog \A\ +e log(1 - |^|) 

= —5 log f — (1 - 25) log -—7^ _ e i 0 g 


1 - J + e ^ '°S + «10S Ml + e ‘og (1 - -j^y) 


1-5 v ' ° 1-5 

= —5 log 5 — (1 — 25) log (1 — 25) + (1 — 5) log (1 — 5) — e log e — (1 — 5) log (1 — 5) + (1 — 5 + e) log (1 — 5 + e) 
+ e log |*4| + e log (1 - ^) 

= 77(5) + 77(e) + e log |„4| - (1 - 25) log (1 - 25) + (1 - 5) log (1 - 5) + (1 - e) log (1 - e) 

+ (1 - 5 + e) log (1 - 5 + e) + elog (1 - ) 

l'95'l 2 5 2 

= 77(5) + 77(e) + elog \A\ - (1 - 25)(loge)(-25 - - 0(5 3 )) + (1 - 5)(loge)(-5 --- 0(8 3 ))+ 


(1 - e)(loge)(—e - j - 0(e 3 )) + (1 - 5 + e)(loge)[-(5 - e) - ^ ^ 


-°((5-e) 3 )] + e(loge)[— r -(^)V2 


= H(S) + 77(e) + elog |^l| + (loge)(e 2 - 5 2 - e5 - e-^r) + 0(nmx(e, 5) 3 ) + e • 0 ({^) 2 ) 

> 77(5) + 77(e) + elog |^4| - -^—e - (21oge) max(e, 5 ) 2 + C7(max(e, 5) 3 ) + e • 0((y^j) 2 ) 

|v4| \A\ 


(48) 

(49) 

(50) 

(51) 

(52) 

(53) 

(54) 

(55) 

- 0 ((^) 3 )] 

(56) 

(57) 

(58) 


IV. Algorithm and Performance 

We propose a unified coding scheme for both APES-AID and RPES-LtRRID processes. The coding scheme is a combination 
of dynamic programming (DP) and entropy coding. Note that using DP to find the edit distance between two sequences is 
well-known in the literature - the contribution here is to demonstrate that for “large” alphabet and “small” amount of edits, this 
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algorithmic procedure results in an expected description length that matches information-theoretic lower bounds up to lower 
order terms. Coding schemes achieving alphabet-size rates that match the lower bounds in Theorem [9] and Theorem [ 8 ] is an 
ongoing direnction. 

A. Algorithm 

For this section of a unified algorithm for both APES-AID and PRES-LtRRID processes, we unify the notation by notation 
without bars. 

The encoder <f>„ takes in the following inputs: the PreESS X and the PosESS Y, and outputs a transmission T as follows: 
Step 1 DP-enc: The first subroutine of the encoder runs a dynamic program on the input (X, Y) to output an edit pattern E 
with ere insertions and Sri deletions. This edit pattern E satisfies the condition that (e + S)n is the minimum number of edits 
needed to convert X to Y. “Standard” edit-distance algorithms typically run in time that is quadratic in re, the lengths of the 
strings being compared. We reference here Ukkonens work 11251 since it gives an algorithm that is 0{nk), where k refers to 
the edit distance - the minimum number of edits needed to process on X to get Y, and is hence faster. 

Step 2 Repre-enc: Represent the edit pattern E as a pair of sequences (O n+en ,C en ), where the edit operation pattern 
Qn+en g {l,A, fj} n + en specifies the edit operations of the output edit pattern by DP and the insertion content pattern C en £ A fn 
specifies the content of insertions of the output edit pattern by DP. 

Step 3 Entro-enc: The encoder uses Lempel-Ziv entropy code to compress O n+en and C en . 

The output of the encoder is a composition of the above three steps, Enc(X, Y) = Entro(Repre(DP(X., Y))). 

The decoder decodes () rl+f n and C' e " by an entropy decoder corresponding to the entropy encoder in Step 3, and reconstructs 
Y from (X, O n+lr \ C fn ). 

B. Performance 

It is well known in literature that dynamic programming finds the edit distance between two sequences - the minimal total 
number of edits (insertions, deletions and substitutions) needed to convert one sequence to the other. Whereas in our model with 
only insertions and deletions, it is straightforward to further deduce that the number of insertions and the number of deletions 
output by DP are both minimized, for the following reason. For all the edit patterns that converts X to Y, the number of 
insertions (AT/) and the number of deletions (Kjf) subject to the constraint K d — Kj = |X| — |Y|, where the lengths of two 
source sequences |X| and |Y| are fixed given the two sequences. Hence, minimizing K d 4 - Kj over all the edit patterns that 
converts X to Y minimizes both K /> and I\j. For the proof of Theorem ITOl and fTTl we only need a looser statement which 
is stated in the following Fact [3] 

Fact 3. The number of insertions (respectively the number of deletions) of the edit pattern output by dynamic programming 
in (respectively 6n) is always no larger than the number of insertions of the actual edit pattern (respectively the number of 
deletions of the actual edit pattern). Hence, for the arbitrary (e, S)-Indel process, 

i < e, 6 <6. (59) 

In the limit as the block length n goes to infinity, the compression rate of the above algorithm is lirn,,^^ —H(O n+tn , C f n ). 
In the following we characterize upper bounds on the compression rate of the algorithm for both RPES-LtRRID process and 
APES-AID process. 

1) Performance for RPES-LtRRID Process: In the RPES-LtRRID process, the number of deletions and insertions may 
exceed the expectation and jz^(n + 1) respectively, in which case may lead to more bits transmitted. Moreover, the 

number of insertions can be unbounded. In Theorem [TO] blow, we show that these events contribute a negligible amount to the 
achievable rate as the block length n tends to infinity, by using Chernoff bound to show that the probability the number of 
insertions/deletions is “much more” than expectation is exponentially small in block length re, while the amount contribute to 
the rate is polynomial in block length re. 

Theorem 10. The algorithm achieves a rate of at most H(5) + H(e) + e log |^4| + (log |Zl| + loge — 2) max (e, S) 2 t + 
0( max (e, <5) 3 ) for any tau > 0 for the RPES-LtRRID process. 

Sketch proof: The number of deletions ATp is sum of n i.i.d. Bernoulli(y^-). Hence by Chernoff bound, Pr(AT D > (1 + 
n ~ 1 ^ 4 )'T^ n ) — e ~ 3<1-e) v/ ". Similarly, the number of insertions Kj is the sum of re + 1 i.i.d. Geoo(l — e). Hence by 
Chernoff bound, Pr(A4 > (1 + + 1)) < e — at 1 —*) Hence, with probability at least 1 — e _3 o-o %/ " — 

e 3(i-e)(%/"+ by pact [3] A < y4y(l + re -1 / 4 ) and e < + re” 1 / 4 )(l + re -1 ). By Appendix O the information 

rate contributes to Hindoo ±H(O n+en , C en ) is at most H(j^) + H(j^) + ^ log.4, + (loge)( r ^) 2 + O ((yz^) 4 ) = 
H (i^) + H{ i^) + ^ logZl + (loge)e 2 + 0(e 3 ). 
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With probability at most e~ 3 o-o +e~ 3< - 1 - e > Kd £ [(1 + n] and 77/ £ [(l + n - 1 / 4 )y 3 j(ri + l)),n]. 

The number of bits needed to specify the edit pattern is linear in n (bounded from the above by 2 n + n log A). However, 
the probability is exponentially small in n. Hence, as the block length n goes to infinity, the information contributed to 
lim.n^oo ^H(O n+en ,C en ) goes to zero. 

The number of deletions Kd won’t exceed n, whereas the number of insertions Ki can be unbounded. When K / is larger 
than but still linear in n (K i = O(n)), the number of bits needed to specify the edit pattern is linear in n, whereas the 
probability of this event is exponentially small in n. Similarly, when 77/ = O(n), the number of bits needed to specify the edit 
pattern is linear in Kj and the probability of is exponentially small in 77/. Hence, the amount of information rate contributes 
to liirin^oo ^H(O n+en ,C en ) when the Kj exceeds n goes to zero as n goes to infinity. 

From the above analysis, averaging over the randomness of the edit process, lim^^oo 4 H(d n+K ',C K ') < H{j^) + 
H(j^) + log A + (log e)e 2 + 0(e 3 ). By Taylor expansion and the calculations below, the rate achieved by the algorithm 
is upper bounded by 77(5) + 77(e) + elog |.A| + (log |.A| + loge — 2 ) max (e, 5) 2 ~ T + 0(max (e, 5) 3 ). 





5 6 1-e-S 1-e-S 

-log z - z -log —- 

1 — e 1 — 6 1 — e 1 — e 

-zr^— log(5 - 1 — -—-log(l - e - 5) + log(l - e) 

-(5(1 + e + 0(e 2 )) log 5 - (1 - e - <5)(1 + e + 0(e 2 )) log (1 — e - <5) + log (1 - e) 

[—<5log 5 — (1 — 5) log (1 — (5)] — 5(e + 0(e 2 )) log 5 — (1 — <5 + 0(max(e, (5) 2 )) log (1 — e — <5)+ 


log (1 - e) + (1 - 5) log (1 - (5) 

= 77(5) - e51og5 + (1 - 5 + 0(max(e,5) 2 ))(loge)(e + 5 + (e + (5) 2 /2 + 0((e + 5) 3 ))- 
(loge)(e + e 2 /2 + 0(e 3 )) - (1 - 5)(loge)(5 + <5 2 /2 + C>(5 3 )) 

= H (5) — e(51og(5 + (loge) • [e + 5 + e 2 /2 — 5 2 /2 — e — e 2 /2 — 5 + <5 2 /2] + 0(max(e, 5) 3 ) 
= 77(5) - eS 1 ~ T + 0( max(e, <5) 3 ) 


(60) 

(61) 

(62) 

(63) 

(64) 

(65) 

( 66 ) 

(67) 

( 68 ) 





logl6 - t~ 7 log (1 “ 2e) + log (1 " e) 

—e(l + e + £>(e 2 )) log e - (1 - 2 e)(l + e + 0 (e 2 )) log (1 - 2 e) + log (1 - e) 

[-eloge - (1 - e) log (1 - e)] - e(e + <D{e 2 )) loge - (1 - e + 0 (e 2 )) log (1 - 2e) + (2 - 
77(e) - e 2 loge - (1 - e + 0(e 2 ))(loge)(-2e - (2e) 2 /2 + 0(e 3 )) + (2 - e)(loge)(-e - 
77(e) - e 2_T + £>(e 3 ) 


(69) 


(70) 


(71) 

' e) log (1 — e) (72) 
e 2 /2 + C(e 3 )) (73) 
(74) 


- log A = e(l + e + C7(e 2 )) log A 
1—6 

= e log A + (log.4)e 2 + C7(e 3 ) 


(75) 

(76) 


2) Performance for APES-A1D Process: 

Theorem 11. The algorithm achieves a rate of at most 77(5) + 77 (e) + e log |*4| + (log e)e 2 + (9(e 4 ) for the APES-AID process. 


Proof: The asymptotic compression rate of the algorithm in Section IIV-AI is limn^oo ^H(O n+en , C en ) = 
linin^oo 477(0 n+eTl )+lim n _ ) . oo 4 H(C en ) (the contents of insertions are independent with the positions of the edit operations). 
The empirical entropy of O n+tn can be calculated (in Appendix0, hence lim^oo 4 H(O n+tn ) = 77(5) +77(e) + (log e)e 2 + 
C7(e 4 ). The contents of insertions are uniformly drawn from A, hence limn^oo 4 H(C en ) = limn^oo 4 + 77 , log |. 4 | = elog |.4|. 
So the compression rate of the algorithm for the APES-AID process is at most 77(5) + 77(e) + elog |.A| + (loge)e 2 + £>(e 4 ). 
By Fact [3 an upper bound of the compression rate is 77(5) + 77(e) + elog \ A\ + (loge)e 2 + 0(e 4 ). ■ 
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Appendix A 

Different Stochastic InDel Processes 

There are potentially many ways to model a stochastic InDel process. In this paper, we study a left-to-right random InDel 
process modeled as a three-state Markov chain as shown in Fig. [I] It is a memoryless (i.i.d.) random InDel model. A more 
general left-to-right random InDel process with memory is shown in Fig. [2] More details are discussed in Section lll-B 1 1 The 
model was also studied in (8) as a channel with synchronization errors. The authors imposed a maximum insertion length, and 
the insertion/deletion probabilities to equal for the expected-length of the output sequence being the same as the input sequence. 
These two requirements are not needed in our paper. The authors in 0 proposed a block code which is a concatenation of a 
“watermark” code and a LDPC code for this synchronization error channel, and presented the empirical performance of their 
code. 

Another model (possibly more realistic for human editing behavior) is to allow and embed the randomness of the “cursor” 
jumping back and forth. This InDel process can also be modeled as a three-state Markov chain. Fig. [14] shows a special case 
where with “uniform cursor jump”: at each iteration, the cursor jumps to a position which is uniformly distributed in the 
current sequence, deletes the symbol in front with probability p/>, or inserts a symbol uniformly drawn from the alphabet A 
with probability pi = 1 — Pd- We believe our approach will derive similar results for this model, because the probability of the 
insertion-deletion interaction is of order O(eS), which to the lower order term. Such a model typically ends up generating “sparse 
isolated edits”. A more sophisticated stochastic model, better presenting “realistic” edit scenarios, would have a distribution 
on the cursor jump, and also a distribution on the run-length of insertions and deletions - this is the subject of ongoing 
investigation. 



Since an insertion process can be regarded as the inverse of a deletion process, a random InDel process as in Fig. [15] 
was studied in Go). The authors in ITOll also considered the edit operation substitution. Here we hide the part corresponding 
to the substitution process to just represent the InDel process. In Fig. [15] an auxiliary sequence Z € A" is a length-n 
sequence of symbols drawn i.i.d. uniformly at random from the source alphabet A. Sequences X and Y are generated 
from Z through two i.i.d. deletion processes with deletion probability pj and pd respectively. Hence, X is a variable length 
(Binomial(n, 1 — pi)) sequence of i.i.d. symbols from A. The authors in fflQl proposed and algorithm which is asymptotically 
optimal for small insertion and deletion probability. More specifically, their algorithm is C9(max(p/, Pd) 2 ~ t ) far from optimal 
lim„_,.qo ifT(Y|X)0 However, they didn’t derive the explicit expression for the term ^//(Y|Xj for the InDel 

proceso. Whereas one of our main effort was to characterize the explicit expression of the optimal rate. 



Fig. 15: other stochastic model 2 

There are also many different stochastic insertion/deletion model in the line of works about insertion/deletion channels. 
A random InDel model where each source bit/symbol is deleted with probability pd, or with an extra bit/symbol inserted 
after it with probability pj, or transmitted/kept (no deletion or insertion after) with probability 1 pi> — pi was studied in 
both lfT3l . Il26l . In (26) . capacity lower bounds for channels modeled as this InDel process are proposed. In fl3l . an algorithm 
for two-way file synchronization under non-binary non-uniform source alphabet was proposed. The Gallager model (27), also 
studied in (28) . is an InDel channel where each transmitted bit independently gets deleted with probability pn or replaced 
with two random bits with probability pi. 

10 Opposite from oa in our paper we use X for the side-information and Y for the sequence to be synchronized. 

11 For the case with only deletions, the authors do have an information-theoretic lower bound in their earlier work DU 
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Appendix B 
Proof of FactQ] 

We adopt the following notation in this proof: 

1. Given a sequence, a newly inserted symbol is written with a superscript t (a 1 ). 

2. Given a string, a deleted symbol is not actually deleted, but instead, is written with a subscript A (q:a)- 

Note that with this notation, the scenario of deleting an inserted symbol is represented as a A ; the scenario of inserting a 
deleted symbol is represented as a&a L . 

Take PreESS X and perform the arbitrary (e, 5)-InDel process , to obtain a string of length m <n + en of which, at most 
5n symbols have A-subscript, and at most ensymbols have /-superscript. 

We can discard symbols which have both A-subscript and /.-superscript ( a L ), and treat those as if they were never inserted 
in the first place. Since the symbols with only A-subscript are those found in the PreESS X, it is obvious we can perform 
all the deletions first (an arbitrary /5-deletion process), and then all the insertion (an arbitrary y^y-insertion process because 
the ratio of number of insertions to the length of sequence after the deletions can be at most y^y) to obtain the exact same 
sequence. 


Appendix C 

Entropy encoding rate of o n+in 

The entropy encoder Entro-enc encodes O n+en at the empirical entropy. The empirical distribution of {l, A, fj} in O" 


Pr) = 


1-5 


,P~L = 


1 + e 1+i 
The empirical entropy of the symbols {ZT, A,?)} in () n + fn is, 

1 


PA = 


lim 

n-z-oo (1 + e)n 


1-5 


1 + i 
1 


log 


H(O n+ ~ en ) 

1-5 


: log 


1 +e 1+e 1+e 1+e 


5 5 

log 


1 + i 


1 + e 

(a) 1 


■ [H(6) + H{i) + (1 - e) log(l - e) + (1 + e) log(l + e) 


y + - ■ [H (5) + H(i) + (1 - e)(log e)(—e - - - j + 0(e~ 4 5 6 7 8 9 10 11 12 )) + (1 + e)(log e)(e - - + - + 0(e~ 4 ))] 


1 


[H(8) + H(i) + (loge)e 2 3 + 0(e 4 )], 


1 + i 

where step (a) is by Taylor expansion. 


Hence, 


lim - H(O n+en ) = H(S) + H (e) + (log e)V + 0(e 4 ). 

n—>oo Tl 
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